skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Tsotras, Vassilis"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available December 15, 2025
  2. This paper demonstrates Pynapple-G, an open-source library for scalable spatial grouping queries based on Apache Sedona (formerly known as GeoSpark). We demonstrate two modules, namely, SGPAC and DDCEL, that support grouping points, grouping lines, and polygon overlays. The SGPAC module provides a large-scale grouping of spatial points by highly complex polygon boundaries. The grouping results aggregate the number of spatial points within the boundaries of each polygon. The DDCEL module provides the first parallelized algorithm to group spatial lines into a DCEL data structure and discovers planar polygons from scattered line segments. Exploiting the scalable DCEL, we support scalable overlay operations over multiple polygon layers to compute the layers’ intersection, union, or difference. To showcase Pyneapple-G, we have developed a frontend web application that enables users to interact with these modules, select their data layers or data points, and view results on an interactive map. We also provide interactive notebooks demonstrating the superiority and simplicity of Pyneapple-G to help social scientists and developers explore its full potential. 
    more » « less
  3. This paper demonstratesPynapple-G, an open-source library for scalable spatial grouping queries based on Apache Sedona (formerly known as GeoSpark). We demonstrate two modules, namely,SGPACandDDCEL, that support grouping points, grouping lines, and polygon overlays. TheSGPACmodule provides a large-scale grouping of spatial points by highly complex polygon boundaries. The grouping results aggregate the number of spatial points within the boundaries of each polygon. TheDDCELmodule provides the first parallelized algorithm to group spatial lines into a DCEL data structure and discovers planar polygons from scattered line segments. Exploiting the scalable DCEL, we support scalable overlay operations over multiple polygon layers to compute the layers' intersection, union, or difference. To showcasePyneapple-G, we have developed a frontend web application that enables users to interact with these modules, select their data layers or data points, and view results on an interactive map. We also provide interactive notebooks demonstrating the superiority and simplicity ofPyneapple-Gto help social scientists and developers explore its full potential. 
    more » « less
  4. Join operations are crucial in data analysis, but can suffer inefficiency with large datasets and complex non- equality-based conditions. Optimized join algorithms have gained traction in database research to address these challenges. One popular choice for implementing join algorithms is distributed data processing frameworks, e.g., Hadoop and Spark, but each implementation is highly tailored for specific query types. As a result, they do not address join queries that involve diverse and complex conditions since they are not integrated into a holistic query optimization engine like in DBMSs. On the other hand, implementing new join algorithms on a DBMS from scratch requires substantial effort and expertise. This paper introduces FUDJ, Flexible User-defined Distributed Joins, a framework for complex distributed join algorithms. The key idea of FUDJ is to allow developers to realize new distributed join algorithms into the database without delving into the database internals. As shown, an algorithm implemented in FUDJ is up to an order of magnitude faster than existing user-defined implementations with an order of magnitude fewer lines of code. 
    more » « less
  5. Join operations are crucial in data analysis, but can suffer inefficiency with large datasets and complex non-equality-based conditions. Optimized join algorithms have gained traction in database research to address these challenges. One popular choice for implementing join algorithms is distributed data processing frameworks, e.g., Hadoop and Spark, but each implementation is highly tailored for specific query types. As a result, they do not address join queries that involve diverse and complex conditions since they are not integrated into a holistic query optimization engine like in DBMSs. On the other hand, implementing new join algorithms on a DBMS from scratch requires substantial effort and expertise. This paper introduces FUDJ, Flexible User-defined Distributed Joins, a framework for complex distributed join algorithms. The key idea of FUDJ is to allow developers to realize new distributed join algorithms into the database without delving into the database internals. As shown, an algorithm implemented in FUDJ is up to an order of magnitude faster than existing user-defined implementations with an order of magnitude fewer lines of code. 
    more » « less
  6. Abstract Window queries are important analytical tools for ordered data and have been researched both in streaming and stored data environments. By incorporating ideas for window queries from existing streaming and stored data systems, we propose a new window syntax that makes a wide range of window queries easier to write and optimize. We have implemented this new window syntax in SQL++, an SQL extension that supports querying semistructured data, on top of AsterixDB, a Big Data Management System, thus allowing us to process window queries over large datasets in a parallel and efficient manner. 
    more » « less
  7. This paper studies the spatial group-by query over complex polygons. Given a set of spatial points and a set of polygons, the spatial group-by query returns the number of points that lie within the boundaries of each polygon. Groups are selected from a set of non-overlapping complex polygons, typically in the order of thousands, while the input is a large-scale dataset that contains hundreds of millions or even billions of spatial points. This problem is challenging because real polygons (like counties, cities, postal codes, voting regions, etc.) are described by very complex boundaries. We propose a highly-parallelized query processing framework to efficiently compute the spatial group-by query on highly skewed spatial data. We also propose an effective query optimizer that adaptively assigns the appropriate processing scheme based on the query polygons. Our experimental evaluation with real data and queries has shown significant superiority over all existing techniques. 
    more » « less
  8. ABSTRACT The Doubly Connected Edge List (DCEL) is an edge-list structure that has been widely utilized in spatial applications for planar topological computations. An important operation is the overlay which combines the DCELs of two input layers and can easily support spatial queries like the intersection, union and difference between these layers. However, existing sequential implementations for computing the overlay do not scale and fail to complete for large datasets (for example the US census tracks). In this paper we propose a distributed and scalable way to compute the overlay operation and its related supported queries. We address the issues involved in efficiently distributing the overlay operator and over various optimizations that improve performance. Our scalable solution can compute the overlay of very large real datasets (32M edges) in few minutes. 
    more » « less
  9. Abstract—The Doubly Connected Edge List (DCEL) is a popular data structure for representing planar subdivisions and is used to accelerate spatial applications like map overlay, graph simplification, and subdivision traversal. Current DCEL imple- mentations assume a standalone machine environment, which does not scale when processing the large dataset sizes that abound in today’s spatial applications. This paper proposes a Distributed Doubly Connected Edge List (DDCEL) data structure extending the DCEL to a distributed environment. The DDCEL constructor undergoes a two-phase paradigm to generate the subdivision’s vertices, half-edges, and faces. After spatially partitioning the input data, the first phase runs the sequential DCEL construction algorithm on each data partition in parallel. The second phase then iteratively merges information from multiple data parti- tions to generate the shared data structure. Our experimental evaluation with real data of road networks of up to 563 million line segments shows significant performance advantages of the proposed approach over the existing techniques. 
    more » « less
  10. Effective query optimization remains an open problem for Big Data Management Systems. In this work, we revisit an old idea, runtime dynamic optimization, and adapt it to a big data management system, AsterixDB. The approach runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created by a stage is then used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimates, thus leading to much better execution plans. While it introduces overhead for materializing intermediate results, experiments show that this overhead is relatively small and is an acceptable price to pay given the optimization benefits. 
    more » « less